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1 Abstract 

This paper deals with the Gaussian process based approximation of a code 
which can be run at different levels of accuracy. This method, which is a 
particular case of co-kriging, allows us to improve a surrogate model of a 
complex computer code using fast approximations of it. In particular, we 
focus on the case of a large number of code levels on the one hand and on a 
Bayesian approach when we have two levels on the other hand. 

The main results of this paper are a new approach to estimate the model 
parameters which provides a closed form expression for an important param- 
eter of the model (the scale factor) , a reduction of the numerical complexity 
by simplifying the covariance matrix inversion, and a new Bayesian mod- 
elling that gives an explicit representation of the joint distribution of the 
parameters and that is not computationally expensive. 

A thermodynamic example is used to illustrate the comparison between 2- 
level and 3- level co-kriging. Keywords: surrogate models, co-kriging, multi- 
fidelity computer experiment, Bayesian analysis. 

2 Introduction 

Large computer codes are widely used in science and engineering to study 
physical systems since real experiments are often costly and sometimes im- 
possible. Nevertheless, simulations can sometimes be costly and time-consuming 
as well. In this case, conception based on an exhaustive exploration of the 
input space of the code is generally impossible under reasonable time con- 
straints. Therefore, a mathematical approximation of the output of the code 
- also called surrogate or metamodel - is often built with a few simulations 
to represent the real system. 
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Gaussian Process regression is a particular class of surrogate which makes 
the assumption that prior beliefs about the code can be modelled by a Gaus- 
sian Process. We focus here on this metamodel and on its extension to mul- 
tiple response models. The reader is refered to |14| and [13j for further detail 
about Gaussian Process models. 

Actually, a computer code can often be run at different levels of com- 
plexity and a hierarchy of levels of code can hence be obtained. The aim of 
this paper is to study the use of several levels of a code to predict the output 
of a costly computer code. 

A first metamodel for multi-level computer codes was built by Kennedy 
and O'Hagan [7] using a spatially stationary correlation structure. This 
multi-stage model is a particular case of co-kriging which is a well known 
geostatistical method. Then, Forrester et al. [3] went into more detail about 
the estimation of the model parameters. Furthermore, Forrester et al. pre- 
sented the use of co-kriging for multi-fidelity optimization based on the EGO 
(Efficient Global Optimization) algorithm created by Jones et al. [9]. A 
Bayesian approach was also proposed by Qian and Wu |12| which is com- 
putationally expensive and does not provide explicit formulas for the joint 
distribution of the parameters. 

This paper presents a new approach to estimate the parameters of the 
model which is effective in the case of non-spatial stationarity and when 
many levels of codes are available. In particular, it provides a closed form 
expression for the estimation of the scale factor which is new and of great 
practical interest. Furthermore, this approach allows us to consider prior in- 
formation in the estimation of the parameters. We also address the problem 
of the inversion of the co-kriging covariance matrix when the number of levels 
is large. A solution to this problem is provided which shows that the inverse 
can be easily calculated. Finally, it is known that with a non-Bayesian ap- 
proach, the variance of the predictive distribution may be underestimated [7J. 
This paper suggests a Bayesian modelling different from the one presented 
in |12| which provides an explicit representation of the joint distribution for 
the parameters and avoids prohibitive implementation. 

3 Building a surrogate model based on a hierarchy 
of s levels of code 

Let us assume that we have s levels of code z±(x), . . . , z s (x), x £ M. d , d > 0. 
For all £ = 1, . . . , s the t th scalar output Zt(x) is modelled by Zt(x) = Zt(x, oj) 
where Zt(x,uj),uj € ft is a realization of the Gaussian process Zt(x). We 
will introduce below a consistent set of hypotheses so that the joint process 
(Zt(x)) x ^d t=i ... s is Gaussian given a certain set of parameters. Kenney and 
O'Hagan [7] suggest an autoregressive model to build a metamodel based on 
a multi-level computer code. Hence, we have a hierarchy of s levels of code 
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- from the less accurate to the most accurate - and for each level, the condi- 
tional distribution of the Gaussian process Zt{x) knowing Z\(x), . . . , Zt-\(x) 
is entirely determined by Zt—\{x). Let us introduce here the mathematical 
formalism that we will use in this paper. 

Q C M. d is a compact subset of M. d called the input space or the do- 
main of interest. For t = I, ... ,8, Dt = {x\ , . . . , zi' } is the experimen- 
tal design set at level t containing m points in Q. Let Zt = Zt(Dt) = 
(Zt(xf^), . . . , Zt(xn})) T be the random Gaussian vector containing the val- 
ues of Zt(x) for x E Dt where T stands for the transpose. Let Z = 
(Zf, . . . , Zj) T be the Gaussian random vector containing the values of the 
processes {Zt{x)~)t—x a at the points of the design sets [Dt)t=\ s - We 
assume here that the code output is observed without measurement er- 
ror. The column vector of responses is written z = (zf , . . . , zJ) T , where 

Zt = (zt(xi), . . . , Zt(xn} )) T is the output vector for the level t. 

If we consider Z s (x), the Gaussian process modelling the most accurate 
code, we want to determine the predictive distribution of Z s (xq), xq £ Q 
given Z = z, i.e. the following conditional distribution: [Z s {xq)\Z = z]. 

We assume the Markow property introduced in [7]: 

Cov(Z t (x),Zt- 1 (x')\Z t -i(x)) = Vx^x' (1) 

This means that if Zt—\{x) is known, then nothing more can be learnt about 
Zf(x) from any other run of the cheaper code Zt—i(x') for x' ^ x. This 
assumption leads to the following autoregressive model: 

Z t {x) = p t -i{x)Z t -i{x) + 5 t {x) t = 2,...,s (2) 

where 6t(x) is a Gaussian process independent of Zt-i(x), . . . , Z\{x) and 
pt-\{x) represents a scale factor between Zt(x) and Zt—\{x). We assume that 
pt-i{x) = f Pt _ 1 (x) T P Pt _ 1 , t = 2,...,s, where fp^x) = {f l ptl (x), f%Z\{x)) T 
is a vector of qt-i regression functions - generally including the constant func- 
tion : xeQ-tl - and j3 pt _ 1 <E M 9 '- 1 . 

Conditioning on parameters at, fit an d Ot, St(x) is assumed to be a Gaus- 
sian process with mean ft{x) T fit, where ft(x) is a ^-dimensional vector of 
regression functions, and with a covariance function of the form ct(x,x') = 
cov(5t(x),5t(x')) = afrt{x — x';6t), where a\ is the variance of the Gaus- 
sian process and Qt are the hyper parameters of the correlation function rt- 
Moreover, conditioning on parameters a\, (3\ and 9\, the simplest code Z\{x) 
is modelled as a Gaussian process with mean f\(x) T (3\ and with covariance 
function c\{x,x') = o\ri(x — x'\0\). With this consistent set of hypothe- 
ses, the joint process (Z 1 (x), . . . , Z t (x)) xe Q tt =i,..., s given a 2 = (cr 2 )i=i,...,t, 
9 = (9i)i=i,... t t, P = (As)i=i,...,t and (3 p = (P Pi _ 1 ) i =2,...,t, is Gaussian with 
mean: 

E[Z t (x)\a 2 ,9,p,p p ] = h' t (x) T p (3) 
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K(xf = [ \ X\ Pl (x)) fT{x),[\{p l {x)y^(x),...,p t ^(x)fl l (x)jJ{x) 

(4 



\ \i=l / \i=2 

and covariance: 



coy(Z t (x),Z t (x')\a 2 ,9,P,P p ) = J>? (il^) J ^{x-x'-Oj) (5) 

j'=i v=j / 

For each level t = 2, . . . , s, the experimental design Dt is assumed to be 
such that Dt Q Dt-i- Note that this assumption is not necessary but allows 
us to have closed form expressions for the parameter estimation formulas. 
Furthermore, we denote by Rt(D k , D{) the correlation matrix between points 
in Dk and D[, 1 < k,l < s. Rt(D k , D{) is a (rife x m) matrix with (i,j) entry 
given by: 

[Rt(D k , = r t {xf } - ; t ) 1 < i < n k 1 < j < n t 

We will use the notation: Rt(Dk) = Rt{Dk, Dk). 

[7] present the case where Vt € [2, s], pt~i(x) = pt~\ is constant. Here, 
we will consider the general model presented in equations ^j. We will also 
propose a new approach to estimate the coefficients P pt _ 1 )t=2,...,s based 
on a Bayesian approach, which allows us to get information about their 
uncertainties. In the following section, we describe the case of 2 levels of 
code where the scaling coefficient p is constant and then we will extend it for 
s levels in Section [71 The general case in which p depends on x is addressed 
in Appendix 1X1 



4 Building a model with 2 levels of code 

Let us assume that we have 2 levels of code zi {x) and z\{x). From the 
previous section we assume that: 

f Z 2 (x) = pZi{x) + S(x), x e Q ( ) 

\ (Z 1 (x)) xeQ ±(S(x)) xeQ W 

The goal of this section is to build a surrogate model for Z 2 {x) given the 
observations Z = z with an uncertainty quantification. The strategy is the 
following one. In Subsection 14. 1 1 we describe the statistical distribution of the 
output Z2(xq) at a new point xq given the parameters (/3i, /?2, p), (erf, ct|) an d 
(61,62) and the observations z. In Subsection 14.21 we describe the Bayesian 
estimation of the parameters (j3\,f32,p) and (a"f,cr|) given the observations. 
As pointed out at the end of Subsection 14.21 the hyper-parameters (61,62) 
are estimated using a concentrated restricted log-likelihood method. 
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4.1 Conditional distribution of the output 



For a point xq G Q we determine in this subsection the distribution of 
[Z 2 (xq)\Z = z, (/3i , (3 2 , p) , (o - !^!)' (^1)^2)]- Standard results for normal dis- 
tributions give that: 

[Z a (x ) |<Z = z, , /3 2 , p), (a? , a 2 2 ) , (0i , 2 )] ~ M{m Z2 (x ) , s\ (x )) (7) 
with mean function: 



m Z2 (x) = /i'(x) T /3 + - HP) 



and variance: 



(8) 



(9) 



where we have denoted /3 
by: 



/5i 



( /i T (4 x) ) o \ 
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and where H is defined 



/i T (x«) 

p/i t (4 2) ) / 2 T (4 2) ) 








pF!{D 2 ) 


^2(^2) 



with the notation Fi{Dj) = 

{Pfl(x),fl(x)) and: 

t{x) T =Cov(Z 2 {x),Z) 



. Furthermore, we have h'(x) 



(pa 2 R 1 ({x},D 1 ),p 2 a 2 R 1 ({x},D 2 ) + a 2 R 2 ({x},D 2 )) 

Zi 



(10) 



The covariance matrix V of the Gaussian vector Z 



V 



pa 2 R 1 (D 1 ,D 2 



pafR^D^Dt) p'a(R 1 (D 2 ) + aiR 2 (D 2 



can be written 



4.2 Bayesian estimation of the parameters with 2 levels of 
code 

In this Subsection, we describe the estimation of the parameters ((3\,/3 2 ,p, a 2 , 
cr 2 ,0i,0 2 ) for the 2-level model given the observations Z = z. Due to the 
conditional independence between Z\{x) and S(x), it is possible to estimate 
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separately the parameters (f3\,cr\,Q\) and (P2, p, cr 2 , #2)- We first describe 
the estimations of (/3i,af) given 9\ and (/32,0"2,/?) given 62 , which can be 
obtained in closed forms. We then describe how to estimate 6\ and 62- 

Firstly, we consider the parameters a\, 9\). We choose the following 
non-informative prior distributions corresponding to the "Jeffreys priors" [8|: 

pO8i|of,0i)a:l p{a\)^\ (12) 

Considering the probability density function of [Z\\Pi, a\ , 9\\ and the Bayes 
formula, the posterior distribution of [/3i|zi> 0\, ®i\ is : 

\fr\ Zl AM ([FfRiiD^F^lFlR^D^Z!], [if - F^ 

(13) 

Then, using the Bayes formula, we obtain that the posterior distribution of 



[a 2 1 \z 1 ,d 1 ]^ig( aa>i ,9l) (14) 



[<rf|zi,0i] is: 

Qi 
2 

where TQ{a, Q) stands for the inverse gamma distribution with density func- 
tion: 

Pa,Q(x) = r(a) xa+ l 1«>0 
and the parameters are given by: 

« CT ?K = !!l T Zi Qi =(z l -F 1 (3 l ) T R 1 (D 1 y 1 (z 1 -Fj 1 ) (15) 

with $ x = E^Zu^M = [F^R 1 (D 1 )- 1 F 1 ]- 1 [F[R 1 (D 1 )~ 1 z 1 ]. 

Bayesian estimation of parameters with non-informative "Jeffreys priors " 
[5] gives the same results as maximum likelihood estimation for the param- 
eter Pi. For the parameter af, the estimation given by 2 J°^ 1 — is identical 

to the one obtained with the restricted maximum likelihood method. This 
method was introduced by Patterson and Thompson |11] in order to reduce 
the bias of the maximum likelihood estimator. 

Secondly, let us consider the set of parameters (02, p, o\, #2)- 111 order 
to have closed form formulas for the estimation of (/J2 ,p), we estimate them 
together. The idea to carry out a joint estimation is proposed for the first 
time in this paper and we believe it is important. Indeed, if the cheaper 
code is perfectly known, it can be considered as a regression function and 
so p will be a regression parameter. In this case, it is clear that a separated 
estimation of P2 and p cannot be optimal. 

Using similar Jeffrey prior distributions as in (fT2j) and the same methodology 
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as for the estimation of (/5i, af), we find that: 



[{p,h)\zi,z 2 ,al,e 2 } ~ J\f P2 - 
and: 



i ( [F T R 2 {D 2 )^F\-'[F T R 2 {D 2 )-^Fl [F T ^f^F] 



(16) 



[a 2 |z 2 ,2i,6» 2 ] ~Xg(a CT 2| n2 ,— ) (17) 

where: 



a_2 

°2 



«2 - P2 



n 2 



Q 2 ={z 2 -FX) T R 2 {D 2 )-\z 2 -F\) (18) 



with A = E[(p,f3 2 )\ Zl ,z 2 ,al9 2 } = [F T R 2 (D 2 )~ l F]~ 1 [F T R 2 (D 2 )~~ 1 z 2 ]. The 
design matrix F is such that F = [pz\{D 2 ) F 2 \. Furthermore, the estima- 
tion of a\ given by 2 ® 2 — is the same as the restricted maximum likelihood 
one. 

The hyper-parameters 9\ and 6 2 are found by minimizing the opposite 
of the concentrated restricted log- likelihoods: 

log (|det CRiCDi)) I) + (ni - Pi)log(a\ 2 ) (19) 

log (|det (R 2 (D 2 )) |) + (n 2 -p 2 - l)log(cf 2 2 ) (20) 

These minimizations problems must be numerically solved with a global 
optimization method. We use an evolutionary method coupled with a BFGS 
algorithm. The drawback of the maximum likelihood estimation (see |10] ) is 
that, contrarily to Bayes estimation, we do not have any information about 
the variance of the estimator. Nevertheless, Bayes estimation of the hyper 
parameters Q\ and 6 2 are prohibitive and as noted in |14j the choice of the 
prior distribution is non trivial. Therefore, in this paper, we will always 
estimate these parameters with a concentrated restricted likelihood method. 



5 Bayesian prediction for a code with 2 levels 

The aim of a Bayesian prediction is to provide a predictive distribution for 
Z s (x) integrating the posterior distributions of the parameters and hence 
taking into account their uncertainty. 

A Bayesian prediction for a code with s = 2 levels was suggested in |12| . 
Nevertheless, we propose here a new Bayesian approach with some significant 
differences. First, we assume that the adjustment coefficient is a regression 
function whereas Qian and Wu |12j model it with a Gaussian process. Sec- 
ondly, we use different prior distributions for the parameter estimation. More 
specifically, according to the Bayesian estimation of parameters previously 
presented, we use a joint prior distribution for (/3 2 ,p) conditioned by a\ 
whereas in |12| they use separated prior distributions with p not conditioned 
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by a\. Then, we use a hierarchy between the different parameters. At the 
lowest level is the regressor parameter /3. At the second level is the variance 
parameter a 2 which controls the distribution of the parameter j3. At the top 
level is the parameter 9 which controls the distribution of the parameters at 
the bottom levels. It is common to use a hierarchical specification of mod- 
els for Bayesian prediction as presented in |13| . This strategy will allow us 
to obtain explicit formulas for the joint distribution of the parameters and 
above all, to reduce dramatically the cost of the numerical implementation 
of the complete Bayesian prediction. 

We will also present the case in which we do not have any prior informa- 
tion about the parameters. As described in the previous section, the hyper 
parameter 6 is estimated by minimizing the opposite of the concentrated 
restricted log-likelihood and it is assumed to be fixed to this estimated value 
from now on. 

5.1 Prior distributions and Bayesian estimation of the pa- 
rameters 

Many choices of priors can be made for the Bayesian modelling. Here we 
study the two following cases: 

(I) Priors for each parameter are informative. 

(II) Priors for each parameter are non-informative. 

For the non-informative case (II), we use the improper distributions corre- 
sponding to the "Jeffreys prior" and then the posterior distributions are given 
in Section 14.21 Note that non-informative distributions are used when we do 
not have prior knowledge. For the informative case (I), we will consider the 
following prior distributions: 

~ AU&i,^), [{p,h)\z u al]^M 1+P2 (b x =(l P ) ,<>%V X = ° 

[erf] ~ lQ(a 1 ,j 1 ), [crj\zi] ~X^(a 2 ,7 2 ) 

where b\ £ MP 1 , b\ 6 M 1+P2 , V\ is a (pi x pi) diagonal matrix, V\ is a 
((1 + p 2 ) x (1 +P2)) diagonal matrix and «i, 71, «2 ; 72 > 0. The forms of the 
priors are chosen in order to be able to get closed form expressions for the 
posterior distributions. Note that there are enough free parameters in the 
priors to allow the user to prescribe their means and variances. From the 
previous prior definitions, the posterior distributions of the parameters are: 

1 21 , of ] ~ JV P1 {A\ v\ , Aj ) l(p,(3 2 )\z u z 2 ,a 2 2 ] ~ M P2+1 (A^^A^) (21) 



where: 

(22) 

- = (n) * S [F T^lpi Z2] 



'2 V "2 



and F = [pzi{D 2 ) F 2 \. Furthermore, we have: 



(23) 



[a\\zi] ~ig( a f llni ,Qj-), [al\z 2 , Zl ]~ZQ{af n \% (24) 



where: 



n i = { 71 + (61 - M T (Vi + [if ^(DOFi]- 1 )" 1 ^! - Pi) + Q\ i = (I) 

V< 1 ^^r'Pi) - Ri 1 (D 1 )F 1 (F[R^ 1 (D 1 )F 1 )- 1 lfR^(D 1 )]z 1 i = (II) 

n2 = / 72 + (&a - A) T (F A + [F T J R 2 - 1 (L> 2 ) J P]- 1 )- 1 (6, - A) + Q| i = (I) 

^ 1 z 2 W(A>) " R 2 l {D 2 )F{F T R^{D 2 )F)-^F T R^{D 2 )]z 2 i = (II) 

$1 = (F[R^ 1 (D 1 )F 1 y 1 F[R^ 1 (D 1 )z 1 A = (F T R^ 1 (D 2 )F)- 1 F T R^ 1 (D 2 )z 2 

a\\nx_\ ^ + "1 i=(I) f + a 2 i = (I) 



Mixing of informative and non-informative priors are of course possible 
and easy to implement. As we will discuss in Subsection 15.41 and see in the 
examples of Section [6j the use of informative priors has minor impact on the 
mean estimation but may have a strong impact on variance estimation. 

5.2 Predictive distributions when (3 2 ,p,af and a\ are known 

As a preliminary step towards the Bayesian prediction carried out in the next 
subsection, we give here Bayesian prediction in the form of closed form ex- 
pressions when the parameters /3 2 , p, a\ and a\ are known. The conditional 
distribution of [Z 2 (x)\Z = z, /3 2 , p,af,a 2 ] is given by: 

[Z 2 (x)\Z = z,(3 2 ,p,ala 2 2 ] ~ N {p t (x),o 2 {x)) (25) 

where: 

<r!(x) = s 2 Z2 {x) + k 1 A}kl 



and Aj and v\ are defined by ()22p . Note that the estimated variance is 
augmented by the term k\A\k\ which quantifies the uncertainty due to the 
estimation of fa. k\ is a (1 X p\) vector composed of the p\ first elements of 
the (1 X p\,l x p 2 ) vector k = (k\,k 2 ) = h'(x) T - t(x) T V~ l H. H is given 
by ()4.ip . The existence of closed form formulas is important as it will allow 
for a fast numerical implementation. 

5.3 Bayesian prediction 

Before performing the Bayesian prediction we note that - thanks to the 
explicit joint prior distribution for fa and p, the independence hypotheses 
and the hierarchical specification of the parameters - conditioning on 6, we 
have an explicit formula for the following joint density: 

p{fa, fa, P, o\,a\\z\,z 2 ) = p(fa\<?\ , zi)p(fa,p\o%, zi, z 2 )p(al\z 1 )p(a 2 \z 1 , z 2 ) 

(26) 

This explicit joint density is an original result which contrasts with |12| 
and which allows us to avoid prohibitive implementation for the Bayesian 
analysis. 

First, we consider the predictive distribution with o\ and a\ known. Con- 
sidering the conditional independence assumption between (5(x)) X £q and 
(Z\(x)) x& q, the probability density function of [Z 2 (x)\Z = z,a\,a 2 ] can be 
deduced from the following integral: 

p{z 2 {x)\zx,z 2 ,a\,a\) = / p{z 2 {x)\z\, z 2 , fa, p, of, c4)p(p, fa\zi, z 2 , a\) dpdfa 

JIR 1 +P2 

(27) 

where p{z 2 {x)\z\, z 2 , fa, p,af,a 2 ) is given by (J7J. This integral has to be 
numerically evaluated. Since [p, fa\z\, z 2 , a 2 ] has a known normal distribu- 
tion given by (|21|) . we here use a crude Monte-Carlo algorithm when the 
dimension of fa and p is high, or a trapezoidal quadrature method when it 
is low. 

Then, we infer from the parameters g\ and a 2 . Due to the indepen- 
dence between (<5(x)) xe g and [Zi{x)) x ^q, the probability density function 
of \Z 2 {x)\Z = z] is: 

p(z 2 (x)\zi,z 2 ) = / p(z 2 {x)\z x ,z 2 ,o\,o\)p(o\\z\)p(o\\z\,z 2 )do\do\ (28) 

where p{a\\z\) and p{a 2 \z\, z 2 ) are given by ()24p . This integral has also to 
be numerically evaluated. Since we have a double integration, a quadra- 
ture method will be efficient. We use here a trapezoidal numerical inte- 
gration, defining the region of integration \o\ in j , °i aup ] x [ a 2 inf > a 2 aup \ fr° m 
the equation (f24"|) and such that p(o\ \z\), p(af aup \zi) p(& 2 . nf \zi, z 2 ) and 
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p(u2 su \z\, z-i) are close to 0. This region essentially contains the support 
of the function. Furthermore, we create a non-uniform integration grid dis- 
tributed with a geometric progression. 

Finally °p{zi{x) \ z\, £2) is a predictive density function integrating the pos- 
terior distribution of parameters (/?2, p, Pi, o\, erf)- We nence have a predic- 
tive distribution taking into account the uncertainties due to the parameter 
estimations. 

5.4 Discussion about the numerical evaluations of the inte- 
grals 

We saw in the previous section that we can obtain an analytical prediction 
when P2, P, f 1 and a\ are known. From this analytical formula, we can have 
a Bayesian prediction with only two nested integrations. One of them can 
be approximated with a quadrature or a crude Monte Carlo method, which 
is not too expensive. The other is a double integration approximated with 
a quadrature method which is efficient and not expensive. Therefore, we do 
not use any Markov chain Monte Carlo method and we considerably reduce 
the time and the complexity of the method. This allows us to easily build 
an accurate Bayesian metamodel. Practically, we use 441 integration points 
to approximate (|28p and 1000 Monte-Carlo particles to approximate (f27|) . 
Therefore, we have 441000 call to the predictive density function (|25p . 

To avoid a prohibitive implementation, another approach has also been 
proposed in [1]. They adopt a Bayes linear formulation which requires only 
the specification of the means, variances, and covariance. See [5] for further 
details about the Bayes linear approach. The strength of this method is that 
its computational cost is low. Nonetheless, since it only focuses on poste- 
rior means and covariances, it does not provide the full posterior predictive 
distribution. 

Finally, we highlight the fact that our Bayesian procedure can be used 
to perform multi-fidelity analysis with more than 2 levels of code whereas 
the cost of the one presented in |12) is too high to allow for such analysis. 
We illustrate in Section [8] through an industrial case the great practical 
importance of using more than 2 levels of code. 

6 Toy examples 

We will present in this section some co-kriging metamodels using one-dimensional 
functions inspired by the example presented in pQ. For the following exam- 
ples, we will use a non-Bayesian co-kriging model - i.e. the one presented in 
[7] - but with a Bayesian estimation of the parameters (see Section 14. 2p and 
for the second example we will use a Bayesian co-kriging. 
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Furthermore, the correlation kernels are assumed to be: 



r t (x 



(k) 



x 



.(fc) 



,(0||2' 



t) = exp 



where t, k, I = 1, 2 1 < i < ni 1 < j 1 < ri2- 

Example 1. The aim of this example is to emphasize the effectiveness of 
the presented Bayesian estimation of the parameters (see Section H~2|) . We as- 
sume that the expensive code is given by 22(2:) = (6x — 2) 2 sin(12x— 4) and the 
cheaper code by Z\ (x) = 0.5^2 (x) + 10 (x — 0.5) — 5. The experimental design 
set of the cheapest code is D x = {0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1} 
and the one of the expensive code is D 2 = {0,0.4,0.6, 1}. This example is 
identical to the one-dimensional demonstration presented in [5] . Figure [1] 
shows the functions x *— > Z2(x) and x t— > Z\(x), the training data for zi and 
z\, the ordinary kriging using only the expensive data and the co-kriging us- 
ing expensive and cheap data. To validate the model, the Root-mean-square 



15 - z 2 (x) 

z,(x) 

■ ■ ■ ■ co-kriging 

- - ordinary kriging 

10- A 22 



-10 




— I 1 1 1 1 T" 

0.0 0.2 0.4 0.6 0.8 1.0 



Figure 1: Toy examples. The co-kriging metamodel is very close to the 
expensive output z%{.) and improves significantly the ordinary kriging meta- 
model using the small design D%. 



errors (RMSE) and Q 2 = 1 - E " eT (™ z * ix) Z2{x) ) are computed. 

T. x eT{ m z 2 ( x )-z2) 

The test set T is composed of a regular grid points sampled from to 
1 with a grid step equal to 0.01 and z~2 is the empirical mean evaluated in 
T. The estimated RMSE is 5.68 x 10~ 2 and the coefficient Q 2 is 99.98%, 
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so we have a prediction error closed to 0. The Bayesian estimation of the 
parameters of co-kriging are given in Table [TJ Furthermore, the estimations 
of the hyper-parameters (#1, 62), calculated by maximizing the concentrated 
log-likelihoods {UJ and (J2HD, are #1 = 0.25 and § 2 = 0.80. D x being a regular 
grid with a grid step equal to 0.1 and D2 being composed of points sampled 
from to 1, points of the experimental designs are hence strongly correlated 
which will imply a smooth surrogate model. 



Regression Coefficient 


Estimation 


Variance Coefficient 


Estimation 


P 


2 


4 


32.75 


h 


(20, -20) 


4 


7.02 x 10~ 30 


Pi 


-3.49 







Table 1: A co-kriging example with one- variable functions. Bayesian esti- 
mation of parameters. 

We see that the Bayesian estimation of parameters is very effective since 
the estimations of parameters p and $2 are perfect. Nevertheless this example 
does not highlight the strength of the method since there is a relation between 
z 2{x)xe[o,i] an d z i( x )x£[o,i] which exactly corresponds to the equation (J2|) 
with the error 62 that can be written in terms of the regression functions /2 
exactly. Therefore, if the cheap code is well modelled, like in our case, the 
co-kriging is equivalent to a linear regression. Moreover, the very small value 
of a\ illustrates this. 

Example 2. This example illustrates a case where the non-Bayesian co- 
kriging underestimates the predictive variance whereas the Bayesian one ad- 
justs it. We assume that the expensive code is given by Z2 (x) = (6x — 
2) 2 sin(12x — 4) + sin(10cos(5x)) and the cheaper code is given by z\(x) = 
0.5((6x-2) 2 sin(12x-4)) + 10(x-0.5)-5. Through the term sin(10 cos(5x)), 
the expensive code has high frequencies which are not captured by the cheap 
code and the error 82 is not a simple linear combination of the regression 
functions f2- Figure [2] shows the results of kriging and co-kriging for these 
two functions. The estimated RMSE is 1.05 and the coefficient Q2 is 93.57%, 
we still have a good prediction. The Bayesian estimations of the parameters 
are given in Table [2] and we have 9\ = 0.25 and 62 = 0.07. The values of 
these parameters have been fixed according the following arguments. As the 
cheap code is the same as the one of the Example 1, we keep the same esti- 
mation for 9\. Then, we consider that there are not enough points to carry 
out a significant estimation of 62- Therefore, we fix the value of 62 according 
to the high frequencies introduced by the term sin(10cos(5x)). 
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0.0 0.2 0.4 0.6 0.8 1.0 



x 

Figure 2: Toy examples. The high frequency components of the expensive 
code are not predicted since they are not captured by the cheap code and 
the coarse grid used for the expensive code cannot detect them either. Nev- 
ertheless, the co-kriging improves the ordinary kriging metamodel since the 
cheap code allows us to predict the low frequencies of the expensive code 
accurately. 



Regression Coefficient 


Estimation 


Variance Coefficient 


Estimation 


P 


1.86 




32.75.03 


h 


(18.39,-17.00) 




0.30 




-3.49 







Table 2: A co-kriging example with one-dimensional functions. Bayesian 
estimation of parameters. 

Due to the additional term sin(10 cos(5a;)), the estimation of the parame- 
ter p is less effective than in the first example. This highlights the dependence 
between the estimation of p and the mean of <5(x) 3;6 [o i i] ■ Furthermore, Figure 
[3] represents the confidence interval at plus or minus twice the standard de- 
viation of the predictive distribution in the Bayesian and non-Bayesian case. 
We see that we underestimate the variance of the predictive distribution in 
the non-Bayesian case. This estimation is well adjusted in the Bayesian case. 
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Figure 3: A toy example without any prior information. The thick dot- 
ted line represents the prediction mean, the thin dotted lines represent the 
confidence interval at plus or minus twice the standard deviation in the non- 
Bayesian case and the dashed lines represent the same confidence interval in 
the Bayesian case. 



7 The case of s levels of code 



The aim of this Section is to perform a multi- level co-kriging with any number 
of codes. Let us consider s levels of code. The generalization of the previous 
model is straightforward. Actually, if we note /3 = (f3f , . . . , ff) T , p = 
(pi, . . .,p s -i), cr 2 = (o-f, ■•• ,<r?) and 9 = (6>i, . . . ,9 S ), we have: 



where: 
and: 



Vx G Q [Z s (x)\Z = z,(3,p,a 2 ,9]~N (m Zs (x) , s\ (x)) 
m Zs (x) = ti s (x) T p + t s {x) T V- l (z - H s f3) 



(29) 
(30) 



s 2 Zs { x ) = (j2 Zs- t s{x) T V-H s {x) 

Furthermore, let us denote by Rt = Rt(Dt) the correlation matrix for Dt 
and p s = 0, Vs < 0. The matrix V s has the form: 



V, 



( yC 1 ' 1 ) . . . \ 



(31) 
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The s diagonal blocks of size nt x m are defined by: 



V™ = a 2 R t {D t ) + a 2 t _iP 2 t-iRt-i{D t ) + • • • + a\ J] Pi #i(A) (32) 



and the off-diagonal blocks of size nt x n%i are given by: 

V M = AO !<*<*'< 



(33) 



The vector i s (a;) is such that t s (x) = {t\(x,DiY , . . . , t*(x, D S ) T ) T , where: 



5-1 



t* t {x,D t f = Pt - 1 ti_ 1 (x,D t ) T + a tRt{x,D t ) Kt<s (34) 



\T /Vrs-1 



where ( \\*Z L S Pi ) = 1 and q(x, A) r = ( fJS Pi ) <t?1*i(s, A)- If we define: 

/ 4 T (4°) 



1 < fc,Z < s 



then the matrix H s can be written as: 



P\F 1 {D 2 ) 
pip 2 Fi(D 3 ) 



F 2 (D 2 ) 
P2F 2 (D 3 ) 



(35) 



\{U^iPi)F 1 (D s ) [U°Z 2 Pi) F 2 (D S ) ... F S {D S ) ) 
h' s (x) T and vdtr(Z s (x)) = a\ s are given by the equations ([3]) and ([5|). 

7.1 Bayesian estimation of parameters for s levels of code 

From the assumptions of conditional independence between {5t{x)) x ^Q and 
(Zt-i(x), . . . , Zi(x)) X £Q, we can extend the Bayesian estimation of the pa- 
rameters to the case of s levels. Note that we do not assume the independence 
of fit and pt~i- We can obtain a closed form expression for the estimation of 
(/3j, pt-i)- For all t = 2, . . . , s, we have: 



*t_i,0 t ,<] ~ M([HiR^{D t )H t ) H L t R^\D t )ztM { H t R t (D t )H t ) 

(36) 

where iT 4 = [p t _iz t _i(A) F t (D t )]. Furthermore, if we note X t = E[(p t _i, (3 t )\zt, z t ~i,O t , 
then we have: 



[cr 2 \z t , z t -i,9 t ] ~ lG(a t , ^) 



(37) 
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where a t = r*=Bt=± and Q t = (z t - H t \ t ) T R^(D t )(z t - H t \ t ). 

The REML estimator of a\ is a\ = and we can estimate Ot by mini- 
mizing the expression: 

log(|det(22t(D t ))|) + (n t - p t - g^logfo 2 ) (38) 

The generalization of the Bayesian estimation previously presented is impor- 
tant since it shows that the parameters estimation for a s-levels co-kriging 
is equivalent to the one for s independent krigings. 

7.2 Reduction of computational complexity of inverting the 
covariance matrix V s 

V s is an (2i=i n i x Si=i n i) matrix, its inverse can hence be difficult to 
process. We present in this Subsection a method to reduce the complexity 
of the processing of V, -1 . By sorting the experimental design sets such that 

Vt = 2, . . . , s, A-i = • • • . 4~Sn t ,4\- ■ ■ = (A-i \ A, A) 

it can be shown that Vt = 2, . . . , s the inverse of the matrix V, has the form: 




Vr 1 = l u Pt-i^r ) \ Ps-i- 

R7 1 



vr 1 



01 



/ 

(39) 

with V s \ an (2? =1 fi{ x ^i=i re matrix and R s 1 an (n s x n s ) matrix. 
This is a very important result since it shows that we can deduce V~ l from 
R^ 1 , t = 1, ...,s. Therefore, the complexity of the processing of V, -1 is 
0(^2i = X n i) mstead of C((X^i=i n «) 3 )- Furthermore, from the equation (|39p 
and the Bayesian estimation of parameters presented in Section I7.1| we have 
shown here that building a s-level co-kriging is equivalent to build s inde- 
pendent krigings. We emphasize that, for practical applications, the form 
(|39p for the inverse of V s allows us to perform fine matrix regularization 
in the case of ill-conditioned problems. Indeed, V s is invertible if and only 
if the matrices Rt, t = l,...,s are invertible. Therefore, if the problem 
is ill-conditioned, we just have to regularize the matrices Rt which are ill- 
conditioned too. Then, since A) T , • • • , ^^(x, D S ^\) T ) = p s _itj_ 1 (x) 
it can also be shown that in the equation (|29|) : 



UixfV- 1 = ( p a -i£_i(aOK-i " [ ix(E s_ i TH-n B )>Ps-i R *({ x }i D s)Rj 1 ]A({x}, D S )RJ 1 

(40) 

Therefore, t s (x) T V s 1 is independent of cr 2 . Since ti(x) T V 1 1 = Ri({x}, D\)R 1 1 
does not depend on af, by induction, t s (x) T V~ 1 is independent of of for all 
1 < % < s. We have just shown here that the co-kriging mean does not 
depend on the variance coefficients. 
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7.3 Numerical test on the reduction of computational com- 
plexity 

In the previous section, we have presented a reduction of complexity for the 
co-kriging model by expressing the inverse of the matrix V s with the inverses 
of the matrices Rt, t = 1, . . . , s. We present here a numerical test to highlight 
the gain of CPU time obtained with this method. We focus on the case of 2 
levels of code and we consider the Gaussian kernel for the 2 levels: 

r(x — x ; 6) = exp 

The experimental design set for the cheap code is a regular grid com- 
posed of n\ points between and 1 and the experimental design set for the 
expensive code are the n 2 first points of this grid. We consider the relation 
n\ = An 2 with n 2 = 50, 60, . . . , 500 and the parameter 6 = ^ (the parameter 
9 is controlled by n 2 in order to avoid ill-conditioned covariance matrices). 
The total number of observations is hence n = n\ + n 2 . Figure U] compares 
the CPU time needed to build a co-kriging model with or without reduction 
complexity. 

First, the slope of the two CPU times is close to 3 (the least-squares 
estimation value is 3.03). The complexity of a matrix inversion being 0(n 3 ), 
with n the size of the matrix, the estimation of the slope highlights the fact 
that it is the matrix inversion which leads the CPU time. Then, Figure 
[4] emphasizes that the reduction of complexity is worthwhile. Indeed, we 
see that the ratio between the two CPU time is approximately a constant 
equal to 1.93. We are hence close to the theoretical ratio equal to (n\ + 
^2) 3 /( n i + n 2 ) ~ 1-92 which is obtained when we consider that the CPU 
time is essentially due to the matrix inversion. 

7.4 Toy example on the complexity reduction 

A 3-level co-kriging metamodel is presented in this section to illustrate the 
gain of CPU which can be obtained with the presented reduction of complex- 
ity. We focus on the inversion of the co-kriging matrix V s by comparing the 
CPU time needed with a direct inversion or by using the formula (|39p . We 
assume that the 3 levels of code are given by the followings three dimensional 
functions: 

Z\ (x) = sin(xi) (41) 
z 2 (x) = z\(x) + asin(x 2 ) 2 (42) 
zz{x) = z 2 (x) + bx\sm(xi) (43) 

with x = (xi,X2)X%) E [— vr,7r] 3 , a = 7 and b = 1/10. We note that the 
complex function z%(x) corresponds to the Ishigami function which is very 
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Figure 4: CPU time comparison between 2 level co-kriging models. The tri- 
angles represent the CPU time for the crude co-kriging model and the circles 
represent the CPU time for the co-kriging model with the complexity reduc- 
tion. The gain of CPU time with the reduction complexity is approximately 
a factor equal to 1.93. 



popular in the field of sensitivity analysis |15| . We consider 713 = 50 obser- 
vations for the most accurate code Za(x), ri2 = 200 for the intermediate code 
and z\ = 400 for the less accurate code. All experimental design sets are 
randomly sampled from the uniform distribution. As presented in Section [3] 
we consider nested experimental designs Vi = 2, . . . , s Dt Q D 

We use a tensorised Matern-| kernel for the three correlation functions: 

d 

r t (x, x'; 6 t ) = r 1D {x h x[; 9 t ,i) (44) 
i=i 

with r 1D (t , t'; 6) = (l + + § exp (- v^^) , t, f € R. 

The estimations of the hyper-parameters 0t are presented in Table [3l 
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Parameter 


Estimation 


9i 


( 0.61 1.99 2.04 ) 


2 


( 1.98 0.26 2.48 ) 


O3 


( 0.23 0.89 0.21 ) 



Table 3: Toy example on the complexity reduction. Estimation of the hyper- 
parameters (correlation lengths) for the 3-level co-kriging. 



The hyper-parameter estimates show us that Z\{x) is very smooth in 
the directions X2 and X3 reflecting the fact that it depends only on the first 
direction x\. Similarly, the bias between z% (x) and z\{x) only depending 
on the second direction X2, it is rougher on this direction and very smooth 
in the others. Finally the bias between 23 (x) and Z2 (x) is rougher in the 
direction x$ than in the directions x\ and X2- This is due to the important 
impact of X3 on the third level. 

The estimation of the variance, scale and regression parameters are given 
in Table H 



Parameter 



Pi 

P2 

h 



Estimation 



0.00 
0.99 
2.44 
0.95 
0.64 
0.09 
1.66 
6.25 



Table 4: Toy example on the complexity reduction. Estimation of the vari- 
ance, scale and regression parameters for the 3-level co-kriging. 

Table [4] shows the efficiency of the suggested method for the parameter 
estimations since it provides very accurate estimations of p\ and p2- 

To evaluate the accuracy of the co-kriging model, we use a test set of 
30,000 points uniformly sampled from the uniform distribution. Then, we 
compute the coefficient Q2 with the co-kriging predictions and the responses 
of z%{x) on this set. We obtain for the co-kriging model Q2 = 83.21%, 
we hence have a middling accuracy despite the large number of observa- 
tions used. Nonetheless, we have a significant improvement relatively to 
the kriging model since with the same kernel and the same experimental 
design set -D3 we obtain Q2 = 47.97% which is a very poor accuracy. The 
hyper-parameter estimation of the kriging model is 9 = (0.79, 0.14, 0.29), the 
variance one is a 2 = 13.66 and the trend coefficient one is /3 = 3.89. 
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Let us now compare the difference of CPU time between the co-kriging 
building with a crude inversion of the covariance matrix V, and the one with 
an inversion using the formula presented in Subsection 17.21 The CPU time 
necessary without the reduction complexity is CPU cru( je = 0.47 whereas the 
one necessary with the complexity reduction is CPUiight — 0.14. We hence 
find that the CPU time ratio between the two methods approximately equals 
3.36. This is not far from the theoretical ratio which equals 650 3 /(400 3 + 
200 3 + 50 3 ) ~ 3.80. We note that the complexity reduction could be of 
important practical interest. For example, without it the computational cost 
of a leave-one-out cross validation procedure will be much more important 
(the ratio will still be around 3 in our example). The complexity of this 
procedure being 0(n 4 ), the gain of CPU time will be substantially. 

8 Example : Fluidized-Bed Process 

This example illustrates the comparison between 2-level and 3-level co-kriging. 
A 3-level co-kriging method is applied to a physical experiment modelled by 
a computer code. The experiment, which is the measurement of the temper- 
ature of the steady-state thermodynamic operation point for a fluidized-bed 
process, was presented by [2], who developed a computer model named "Top- 
sim" to calculate the measured temperature. The code, developed for a Glatt 
GPCG-1, fluidized-bed unit in the top-spray configuration, can be run at 3 
levels of complexity. We hence have 4 available responses: 

1. T exp : the experimental response. 

2. T3: the most accurate code modelling the experiment. 

3. T2: a simplified version of T3. 

4. T\\ the lowest accurate code modelling the experiment. 

The differences between T±, T2 and T3 are discussed by Dewettinck et al. 
(1999). The aim of this study is to predict the experimental response T exp 
given the two levels of code T3 and T2. We only focus on a 3-level co-kriging 
using T3 and T2 to predict T exp since 28 responses available for each level 
is not enough to build a nested experimental design relevant for a 4-level 
co-kriging. The experimental design set and the responses T±, T2, T3 and 
T exp are given by |12| who have presented a 2-level co-kriging using T exp and 
T2. Furthermore, the responses are parameterized by a 6-dimensional input 
vector presented by Dewettinck et al. (1999). 

8.1 Building the 3-level co-kriging 

To build the 3-level co-kriging, we use 10 measures of T exp (measures 1, 3, 
8, 10, 12, 14, 18, 19, 20, 27 in Table 4 in [12]), 20 simulations of T 3 (runs 
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1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 18, 19, 20, 22, 24, 27) and the 
28 simulations of Ti and the input vector is scaled between and 1. The 
last 18 measures of T exp are used for validation. The design sets are nested 
such that Dt-\ = (Dt-i \ Dt,Dt) for t = 2, 3 and we use a Matern| kernel 
for the three covariance functions. The estimations of the hyper-parameters 
which represent correlation lengths of the three covariance kernels are given 
in Table [SJ 



0i 


1.790 


3.988 


1.218 


1.790 


3.595 


0.722 


02 


1.810 


1.842 


2.008 


1.036 


0.001 


0.345 


03 


0.890 


0.721 


2.008 


2.952 


1.790 


0.241 



Table 5: Example: fluidized-bed process. Estimation of the hyper- 
parameters (correlation lengths) for the 3-level co-kriging. 

The estimations of hyper-parameters in Table show us that the sur- 
rogate model will be very smooth in the first four directions. For the fifth 
direction the Gaussian processes modelling the cheap code T2 and the bias 
between T exp and T3 are very smooth and the one modelling the bias be- 
tween T3 and T2 is close to a regression. Finally, the model is sharper in the 
sixth direction in particular for the two biases where correlation lengths are 
around 0.3. 

Furthermore, Table [6] gives the estimation of the variance and regression 
parameters (see section 17. ip . 



Posterior (Jovariance 



Regression coefficient 



Posterior mean 



Pi 

h 
Ppi 
h 



47.02 
0.97 

-0.17 
0.95 
1.93 



0.134 



0.001 
-0.034 

0.003 
-0.121 



-0.034 
1.610 

-0.121 
5.188 



Variance coefficient 



Qt 
1032 
5.30 
8.39 



at 
13.5 
9 
4 



Table 6: Example: fluidized-bed process. Bayesian estimation of the variance 
and regression parameters for the 3-level co-kriging. 

Table[S]shows that the responses have approximately the same scale since 
the adjustment coefficients are close to 1. Furthermore, we see an important 
bias between T3 and T2 with /3s = 1.93. Finally, the variance coefficients for 
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the biases indicate that they are possibly much simpler to model than the 
cheap code T2 as their estimations are smaller. 

8.2 3-level co-kriging prediction: predictions when code out- 
put is available 

The aim of this Section is to show that co-kriging can improve significantly 
the accuracy of the surrogate model at points where at least one level of 
responses is available. 

The predictions of the 3-level co-kriging are here presented and compared 
with the predictions obtained with a 2-level co-kriging using only the 10 re- 
sponses of T exp and the 20 responses of T3. The predictions for the 2-level 
and the 3-level co-krigings vs. the real values (i.e., the measured tempera- 
ture T exp ) are shown in Figure [5) The 3-level co-kriging gives us the same 




Figure 5: Predictions of the 2-level and the 3-level co-krigings for the 
fluidized-bed process. The 3-level co-kriging improves significantly the pre- 
dictions of the 2-level one. 



prediction means as the 2-level co-kriging at the 10 points (points 2, 5, 6, 
7, 9, 11, 13, 16, 22, 24) where T3 is known. These overlapped points mean 
that T2 does not influence the surrogate model at these points. This follows 
from the Markov property introduced in Section O which implies that the 
prediction of T exp is entirely determined by T3 at these points. We also note 
that, in general, the 2-level co-kriging predictions - at points where T3 is 
unknown - are not accurate and the 3-level co-kriging improves significantly 



23 



the prediction means compared to the 2-level co-kriging. Table [7] compares 
the 2-level co-kriging with the 3-level co-kriging and summarizes some results 
about the quality of the predictions on the 18 validation points. Nonetheless, 
it is important to notice that, in the 3-level case, the output of the cheapest 
code T2 is known at the 18 test points. This means that the results of this 
subsection show that the 3-level co-kriging prediction is more accurate than 
the 2-level co-kriging prediction at a point where the cheapest response T2 
is available. In the next subsection we show that the 3-level co-kriging pre- 
diction is more accurate than the 2-level one at a point where no response is 
available. 







Q2 


RMSE 


MaxAE 


2-level co-krij 


;ing 


61.23 % 


4.24 


14.04 


3-level co-krij 


;ing 


98.71 % 


0.89 


1.98 






Average Std. dev. 


Median Std. dev. 


Maximal Std. dev 


2-level co-krij 


;ing 


2.90 


1.02 


5.68 


3-level co-krij 


;ing 


0.90 


1.02 


1.04 



Table 7: Example: fluidized-bed process. Comparison between 2-level co- 
kriging and 3-level co-kriging. Predictions are better in the 3-level case and 
the prediction variance seems well-evaluated since the RMSE and the average 
standard deviation are close. 

Figure [6] shows the prediction errors of the 2-level co-kriging and the 
confidence interval at plus or minus twice the prediction standard deviation. 
The last 10 prediction errors and their confidence intervals are the same 
as those of the 3-level case since it corresponds to the points where T3 is 
known. We see in Figure [6] that the confidence intervals are well predicted. 
Furthermore, we see a significant difference between the accuracy of the 
prediction means and their confidence intervals for the point where T3 is 
unknown (the 8 first validation points) and for the ones where it is known 
(the last 10 validation points). 

8.3 3-level co-kriging prediction: predictions when code out- 
put is not available 

In this subsection, we show that a multi-level co-kriging can significantly 
improve the prediction of a surrogate model at points where no response is 
available. 

We have seen in Section 18.21 that the 3-level co-kriging improves signif- 
icantly the 2-level co-kriging at points where T3 is unknown and T2 has 
been sampled. Nevertheless, to have a fair comparison between these two 
co-kriging models, we compare their accuracy by applying a Leave-One-Out 
Cross- Validation (LOO-CV) procedure at the 10 points where T exp is known. 
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oj - A Prediction error 

—\— Confidence interval at 95% 



5 10 

Validation point 



Figure 6: Prediction errors of the 2-level co-kriging and confidence intervals 
at plus or minus twice the standard deviation. We see a significant difference 
between the accuracy of the predictions means and their confidence intervals 
for the point where T3 is unknown (the 8 first validation points) and for the 
ones where it is known (the last 10 validation points). 



This means that we perform for each of these 10 points the following proce- 
dure: 

1. The experimental and the two code outputs corresponding to the point 
are removed from the data set. 

2. The 2-level co-kriging method and the 3- level co-kriging method are 
applied using the truncated data set in order to give a confidence in- 
terval for the experimental output at the point. 

Figure [7] shows the result of the LOO-CV procedure for the 2-level and 3- 
level co-kriging. We see that the 3-level co-kriging is more accurate than the 
2-level one. Indeed, the LOO-CV RMSE for the 2-level co-kriging is equal 
to 1.88 whereas it is equal to 1.09 for the 3-level co-kriging. This shows that 
the 3-level co-kriging provides better predictions also at points where no 
response is available. This highlights the strength of the proposed method 
and shows that a co-kriging method with more than 2 levels of code can be 
worthwhile. 
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o 3-level co-kriging LOO-CV error 
a 2-level co-kriging LOO-CV error 

3-level co-kriging LOO-CV confidence interval at 95% 

2-level co-kriging LOO-CV confidence interval at 95% 
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Figure 7: Leave-One-Out Cross- Validation predictive errors and variances 
of the 2-level and 3-level co-kriging. We see that the confidence intervals 
are accurate and the precision of the 3-level co-kriging is significantly better 
than the one of the 2-level co-kriging. 

9 Conclusion 

We have presented a method for building kriging models using a hierarchy 
of codes with different levels of accuracy. This method allows us to improve 
a surrogate model built on a complex code using information from a cheap 
one. It is particularly useful when the complex code is very expensive. We 
see in our literature review that the first multi-level metamodel originally 
suggested is a first-order auto-regressive model built with Gaussian processes. 
The AR(1) relation between two levels of code is natural and the building 
of the model is straightforward. Nevertheless, we have highlighted some key 
issues which makes it difficult to use this model in practical ways. 

First, important parameters of the model, which are the adjustment 
coefficients between two successive levels of codes, were numerically esti- 
mated. We propose here an analytical estimation of these parameters with 
a Bayesian method. This method allows us to have information about the 
uncertainties of the estimations and above all, to easily use the AR(1) model 
and its generalization to the case of non-spatial stationarity. Furthermore, a 
strength of the proposed method is that it even works for a code with more 
than 2 levels since its implementation is such that the estimations of the 
parameters of a s-level co-kriging is equivalent to the ones of s independent 
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krigings. It is important to highlight that this method is based on a joint 
estimation between the adjustment coefficient and the mean of the Gaussian 
process modelling the difference between two successive levels of code. 

Second, we have seen that the variance of the predictive distribution 
of the AR(1) model could be underestimated. A natural approach to im- 
prove this estimation is a Bayesian modelling. We propose here a Bayesian 
co-kriging for 2 levels of code and to avoid computationally expensive im- 
plementation, we suggest another model than the one presented. This new 
model is based on a hierarchical specification of the parameters of the model. 
This allows us to have a Bayesian model including only two nested integra- 
tions without Markov chain Monte Carlo procedure. 

Finally, for a non-Bayesian s-level co-kriging, we have proved that build- 
ing a s-level co-kriging is equivalent to build s independent krigings. This 
result is very important since it solves one of the most important key issues 
of the co-kriging which is the inversion of the covariance matrix. A 3-level 
co-kriging example has been provided to show the efficiency of the presented 
method. 
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A The case of p depending on x 
A.l Building a model with s levels of code 

Let us consider s levels of code, if we note f3 = (flf, ■ ■ ■ , f3j) T , /3 p = {0p\ , . . . , /3j a _ 1 ) T , 
a 2 = {a 2 ,..., a 2 ) and 9 = (0i,...,0 s )> we have [Z s (x)\Z = z,P,P p ,a 2 ,9] ~ 
J\f [mz s (x), s 2 7 s (x)) where mz s { x ) an d s z s ( x ) are defined in equations (|29p 
and (|30p . Let us define the notation Oi=fc = • • • M where rep- 
resents the matrix element-by-element product. Furthermore, let us denote 
by Pt = Pt(Dt) the vector containing the values of Pt(%), x £ Df. The s 
diagonal blocks of V s (f3T|) of size n t x rit are defined by: 

= a t 2 i? t (A)+^_i {p t -i(D t )pt 1 (D t ))QRt-i(D t )+. • -+a 2 (o Pi (D t ) pj (A)j QRi(D t 
and the off-diagonal blocks of size nt x n t i are given by: 

= (l nt fOft(A')l \ VW(D U D V ) l<t<t'<s 



. i=t 
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The vector t s (x) in equations ([29]) and ()30|) is such that t s (x) = [t\(x, D\) T , . . . , t*(x, D S ) T ) T , 
where: 



i's-I 



where 1< t < s, (n^ 1 = 1 and t\(x, D{f = [J\°z{ Pi (x)J a^R^x, D ± ). 

Furthermore, the matrix H s in equations 1351 can be written as: 

/ : 

((Otl Pi(Dj)) 1^) Q F 1 (D j ) ((©£ ft(A0l£)) F 2 {D 3 ) ... 

V 



H 



A. 2 Bayesian estimation of parameters for s levels of code 

We can extend the Bayesian estimation of the parameters to the case of p 
depending on x. Note that we do not assume the independence of fit and 
fipt-i- We have: 



[OV^A)!*,**-!,^] ~ M {(Hi Ri\D t )H t ) HtR- t \D t )z t y t (H* R^\D t )H, 
where H t = [F /9( _ 1 (D t ) (z t -x (A) ) i*t(A)]- Furthermore, we have: 



where 



nt-pt- qt-i 
2 



Qt = & - A A t ) R^\D t )(z t - H t X t 



X t = E[(P pt _ 1 ,l3 t )\z t ,zt-i,e t ,aj} 

The REML estimator of <r 2 is <5" 2 = J**- and we can estimate by minimizing 
the expression: 

log(|det(A(A))|) + K — Pt~ g t _i)log(ff t 2 ) 



A. 3 Some important results about the covariance matrix V, 

By sorting the experimental design sets as in Subsection 17. 2\ it can be shown 
that Vi = 2, . . . , s the inverse of the matrix V s has the form: 



VI 



( ! ( 

O s _i(A>J-i(A)) ^jjr 



V 



(l^p^A)) ^ 



Hr 1 





( Pa _i(i>.)i£ > ) © 
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with V 1 1 = V a _\ an (X^i=i n « x Xa=i n «) matrix and i? s 1 an (n s x n s ) 
matrix. It can also be shown that: 

A. 4 Bayesian prediction for a code with 2 levels 

The equations for the Bayesian prediction when p depends on x can be 
directly derived from the Section [5] by replacing p with (5 p and noting that 
the design matrix F is such that: 

F=[F p (D 2 )Q( Zl (D 2 )ll) F 2 ] 
Finally, for the Bayesian prediction, we just have to adapt the integral 

p(z 2 (x)\zi,z 2 ,al,al) = / p(z 2 (x)\zi, z 2 , /3 2 , /3 p ,al,a 2 )p(f3 p , /3 2 \z 1 , z 2 , a 2 ) d(3 p d/3 2 

JRPP+P2 
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